Seeing Voices and Hearing Faces: Cross-modal biometric matching

نویسندگان

  • Arsha Nagrani
  • Samuel Albanie
  • Andrew Zisserman
چکیده

We introduce a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker. In this paper we study this, and a number of related cross-modal tasks, aimed at answering the question: how much can we infer from the voice about the face and vice versa? We study this task “in the wild”, employing the datasets that are now publicly available for face recognition from static images (VGGFace) and speaker identification from audio (VoxCeleb). These provide training and testing scenarios for both static and dynamic testing of cross-modal matching. We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching; (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available); and (iii) we use human testing as a baseline to calibrate the difficulty of the task. We show that a CNN can indeed be trained to solve this task in both the static and dynamic scenarios, and is even well above chance on 10-way classification of the face given the voice. The CNN matches human performance on easy examples (e.g. different gender across faces) but exceeds human performance on more challenging examples (e.g. faces with the same gender, age and nationality)1.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Perceptual complexity of faces and voices modulates cross-modal behavioral facilitation effects

Joassin et al. (Neuroscience Letters, 2004, 369, 132-137) observed that the recognition of face-voice associations led to an interference effect, i.e. to decreased performances relative to the recognition of faces presented in isolation. In the present experiment, we tested the hypothesis that this interference effect could be due to the fact that voices were more difficult to recognize than fa...

متن کامل

Men's Preferences for Women's Femininity in Dynamic Cross-Modal Stimuli

Men generally prefer feminine women's faces and voices over masculine women's faces and voices, and these cross-modal preferences are positively correlated. Men's preferences for female facial and vocal femininity have typically been investigated independently by presenting soundless still images separately from audio-only vocal recordings. For the first time ever, we presented men with short v...

متن کامل

When audition alters vision: an event-related potential study of the cross-modal interactions between faces and voices.

Ten healthy volunteers took part in this event-related potential (ERP) study aimed at examining the electrophysiological correlates of the cross-modal audio-visual interactions in an identification task. Participants were confronted either to the simultaneous presentation of previously learned faces and voices (audio-visual condition; AV), either to the separate presentation of faces (visual, V...

متن کامل

Fingerprint-face Feature Matching Using Multi-modality System by Agglomerative Multi-clustering

The biometric authentication is an effective alternative for traditional authentication techniques. Because biometric data cannot be easily restored or revoked, it is significant that biometric templates used in biometric applications must be built and stored in a secure way, such that attackers could not be able to falsify biometric data easily even when the templates are negotiated. Researche...

متن کامل

PAPER Neural correlates of perceptual narrowing in cross-species face-voice matching

Integrating the multisensory features of talking faces is critical to learning and extracting coherent meaning from social signals. While we know much about the development of these capacities at the behavioral level, we know very little about the underlying neural processes. One prominent behavioral milestone of these capacities is the perceptual narrowing of face–voice matching, whereby young...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018